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Abstract 

We review a recent trend in computational systems biology which aims at using pattern 
recognition algorithms to infer the structure of large-scale biological networks from heteroge- 
neous genomic data. We present several strategies that have been proposed and that lead to 
different pattern recognition problems and algorithms. The strength of these approaches is il- 
lustrated on the reconstruction of metabolic, protein-protein and regulatory networks of model 
organisms. In all cases, state-of-the-art performance is reported. 

1 Introduction 

In this review chapter we focus on the problem of reconstructing the structure of large-scale biological 
networks. By biological networks we mean graphs whose vertices are all or a subset of the genes and 
proteins encoded in a given organism of interest, and whose edges, either directed or undirected, 
represent various biological properties. As running examples we consider the three following graphs, 
although the methods presented below may be applied to other biological networks as well. 

• Protein-protein interaction (PPI) network. This is an undirected graph with no self-loop, that 
contains all proteins encoded by an organism as vertices. Two proteins are connected by an 
edge if they can physically interact. 

• Gene regulatory network. This is a directed graph that contains all genes of an organism as 
vertices. Among the genes, some called transcription factors (TFs) regulate the expression of 
other genes through binding to the DNA. The edges of the graph connect TFs to the genes 
they regulate. Self-loops are possible if a TF regulates itself. Moreover each edge may in 
principle be labeled to indicate whether the regulation is a positive (activation) or negative 
(inhibition) regulation. 

• Metabolic network. This graph contains only a subset of the genes as vertices, namely those 
coding for enzymes. Enzymes are proteins whose main function is to catalyse a chemical re- 
action, transforming substrate molecules into product molecules. Two enzymes are connected 
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in this graph if they can catalyse two successive reactions in a metabolic pathway, i.e., two 
reactions such that the main product of the first one is a substrate of the second one. 

Deciphering these networks for model organisms, pathogens or human is currently a major challenge 
in systems biology, with many expected applications ranging from basic biology to medical applica- 
tions. For example, knowing the detailed interactions possible between proteins on a genomic scale 
would highlight key proteins that interact with many partners, which could be interesting drug tar- 
gets [21], and would help in the annotation of proteins by annotation transfer between interacting 
proteins. The elucidation of gene regulatory networks, especially in bacteria and simple eukaryotes, 
would provide new insights into the complex mechanisms that allow an organism to regulate its 
metabolism and adapt itself to environmental changes, and could provide interesting guidelines for 
the design of new functions. Finally, understanding in detail the metabolism of an organism, and 
clarifying which proteins are in charge of its control, would give a valuable description of how or- 
ganisms have found original pathways for degradation and synthesis of various molecules, and could 
help again in the identification of new drug targets [28] . 

Decades of research in molecular biology and genetics have already provided a partial view of 
these networks, in particular for model organisms. Moreover, recent high-throughput technologies 
such as the yeast two-hybrid systems for PPI, provide large numbers of likely edges in these graphs, 
although probably with a high rate of false positives [HH [19]. Thus, much work remains to be 
done in order to complete (adding currently unknown edges) and correct (removing false positive 
edges) these partially known networks. To do so, one may want to use information about individual 
genes and proteins, such as their sequence, structure, subcellular localization, or level of expression 
across several experiments. Indeed, this information often provides useful hints about the presence 
or absence of edges between two proteins. For example, two proteins are more likely to interact 
physically if they are expressed in similar experiments, and localized in the same cellular compart- 
ment; or two enzymes are more likely to be involved in the same metabolic pathway if they are 
often co-expressed, and if they have homologs in the same species [251 1301 120] . 

Following this line of thought, many approaches have been proposed in the recent years to infer 
biological networks from genomic and proteomic data, most of them attempting to reconstruct the 
graphs de novo. In de novo inference, the data about individual genes and proteins are given, 
and edges are inferred from these data only, using a variety of inference principles. For example, 
when time series of expression data are used, regulatory networks have been reconstructed by fitting 
various dynamical system equations to the data [H HH [37j dU dUJ EJ [2]. Bayesian networks have 
also been used to infer de novo regulatory networks from expression data, assuming that direct 
regulation can be inferred from the analysis of correlation and conditional independence between 
expression levels [15] . Another rationale for de novo inference is to connect genes or proteins that are 
similar to each other in some sense (25j EU] , For example, co-expression networks, or the detection of 
similar phylogenetic profiles are popular ways to infer "functional relationships" between proteins, 
although the meaning of the resulting edges has no clear biological justification [36] . Similarly, some 
authors have attempted to predict gene regulatory networks by detecting large mutual information 
between expression levels of a TF and the genes it regulates [9l fT4] . 

In contrast to these de novo methods, in this review we present a general approach to reconstruct 
biological networks using information about individual genes and proteins, based on supervised 
machine learning algorithms, as developed through a recent series of articles [45l l43l l46l l3l IE1 l42l l8l 
[27] . The graph inference paradigm we follow assumes that, besides the information about individual 
vertices (genes or proteins) used by de novo approaches, the graph we wish to infer is also partially 
known, and known edges can be used by the inference algorithm to infer unknown edges. This 
paradigm is similar to the notion of supervised inference in statistics and machine learning, where 
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one uses a set of input/output pairs (often called the training set) to estimate a function that can 
predict the output associated to new inputs [TTl [7] . In our paradigm, we give us the right to use 
the known edges of the graph to supervise the estimation of a function that could predict whether 
a new pair of vertices is connected by an edge or not, given the data about the vertices. Intuitively, 
this setting can allow us to automatically learn what features of the data about vertices are the 
most informative to predict the presence of an edge between two vertices. In a sense, this paradigm 
leads to a problem much simpler than the de novo inference problem, since more information is 
used as input, and it might seem unfair to compare de novo and supervised methods. However, 
as already mentioned, in many real-world cases of interest we already partially know the graph we 
wish to infer. It is therefore quite natural to use as much information as we can in order to focus 
on the real problem, which is to infer new edges (and perhaps delete wrong edges), and therefore to 
use as input both the genomic and proteomic data, on the one hand, and the edges already known, 
on the other hand. 

In a slightly more formal language, we therefore wish to learn a function that can predict whether 
an edge exists or not between two vertices (genes or proteins), given data about the vertices (e.g., 
expression levels of each gene in different experimental conditions). Technically this problem can 
be thought of as a problem of binary classification, where we need to assign a binary label (presence 
or absence of edge) to each pair of vertices, as explained in Section 12.11 From a computational 
point of view, the supervised inference paradigm we investigate can in principle benefit from the 
availability of a number of methods for supervised binary classification, also known as pattern 
recognition [7]. These methods, as reviewed in Section l2~2l below, are able to estimate a function 
to predict a binary label from data about patterns, given a training set of (pattern, label) pairs. 
The supervised inference problem we are confronted with, however, is not a classical pattern/label 
problem, because the data are associated to individual vertices (e.g., expression profiles are available 
for each individual gene), while the labels correspond to pairs of vertices. Before applying out of the 
box state-of-the-art machine learning algorithms, we therefore need to clarify how our problem can 
be transformed as a classical pattern recognition problem (Section l2.3p . In particular, we show that 
there is not a unique way to do that and present in Sections 12.41 and 12.51 two classes of approaches 
that have been proposed recently. Both classes involve a support vector machine (SVM) as binary 
classification engine, but follow different avenues to cast the edge inference problem as a binary 
classification problem. In Section [3j we provide experimental results that justify the relevance 
of supervised inference, and show that a particular approach, based on local models, performs 
particularly well on the reconstruction of PPI, regulatory and metabolic networks. We conclude 
with a rapid discussion in Section [U 

2 Graph reconstruction as a pattern recognition problem 

In this section we define formally the graph reconstruction problem considered, and explain how to 
solve it with pattern recognition techniques. 

2.1 Problem formalization 

We consider a finite set of vertices V = (v±, . . . ,v n ) that typically correspond to the set of all genes 
or proteins of an organism. We further assume that for each vertex v G V we have a description of 
various features of v as a vector <p(v) G W . Typically, cp(v) could be a vector of expression levels 
of the gene v in p different experimental conditions, measured by DNA microarrays, a phylogenetic 
profile which encodes the presence or absence of the gene in a set of p sequenced genomes [30], a 
vector of p sequence features, or a combination of such features. We wish to reconstruct a set of edges 
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E C V x V that defines a biological network. While in de novo inference the goal is to design an 
algorithm that automatically predicts edges in E from the set of vertex features (0(t»i), . . . , (p(v n )), 
in our approach we further assume that a set of pairs of vertices known to be connected by an 
edge or not is given. In other words we assume given a list S = ((ei,yi), . . . , (ejv,yjv)) of pairs of 
vertices (ej € V x V) tagged with a label yi £ {—1, 1} that indicate whether the pair is known 
to interact (yi = 1) or not (t/j = —1). In an ideal noise-free situation, where the labels of pairs in 
the training set are known with certainty, we thus have yi = 1 if ej £ £7, and yt = —1 otherwise. 
However, in some situations we may also have noise or errors in the training set labels, in which 
case we could only assume that pairs in E tend to have a positive label, while pairs not in E tend 
to have a negative label. 

The graph reconstruction problem can now be formally stated as follows: given the training 
set 5 and the set of vertex features (4>(vi), . . . ,4>(v n )), predict for all pairs not in S whether they 
interact (i.e., whether they are in E) or not. This formulation is illustrated in Figured) 




Figure 1: We consider the problem of inferring missing edges in a graph (dotted edges) where a few 
edges are already known (solid edges). To carry out the inference, we use attributes available about 
individual vertices, such as vectors of expression levels across different experiments if vertices are 
genes. 

Stated this way, this problem is similar to a classical pattern recognition problems, for which a 
variety of efficient algorithms have been developed over the years. Before highlighting the slight dif- 
ference between the classical pattern recognition framework and ours, it is therefore worth recalling 
this classical pattern recognition paradigm and mentioning some algorithms adapted to solve it. 

2.2 Pattern recognition 

Pattern recognition, of binary supervised classification, is a well-studied problem in statistics and 
machine learning [171 [7]. In its basic set-up, a training set T = {(ui,ti), . . . , (utv,£tv)} of labeled 
patterns is given, where Ui S W is a vector and U £ {— 1,1} is a binary label, for i = 1, . . . , N. The 
goal is then to infer a function / : R 9 — > { — 1, 1} that is able to predict the binary label t of any 
new pattern u € M 9 by f(u). 

Many methods have been proposed to infer the labeling function / from the training set T, 
including for example nearest neighbor classifiers, decision trees, logistic regression, artificial neural 
networks or support vector machines (SVM). Although any of these methods can be used in what 
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follows, we will present experiments carried out with an SVM, which we briefly describe below, 
mainly for three reasons: 

• It is now a widely-used algorithm, in particular in computational biology, with many public 
implementations [341 [4Tj . 

• It provides a convenient framework to combine heterogeneous features about the vertices, such 
as the sequence, expression and subcellular localization of proteins [291 1451 124"] . 

• Some methods developed so far for graph inference, which we describe below, are particularly 
well adapted for a formalization in the context of SVM and kernel methods [3l 142]. 

Let us therefore briefly describe the SVM algorithm, and redirect the interested reader to various 
textbooks for more details [HH [12], [33| . Given the labeled training set T, an SVM estimates a linear 
function h(u) = w T u for some vector w E R 9 (here w T u represents the inner product between 
w and u), and then makes a label prediction for a new pattern u that depends only on the sign 
of h(u): f(u) = 1 if h(u) > 0, f(u) = —1 otherwise. The vector w is obtained as the solution 
of an optimization problem that attempts to enforce a correct sign with large absolute values for 
the values h{u{) on the training set, while controlling the Euclidean norm of w. The resulting 
optimization problem is a quadratic program for which many specific and fast implementations 
have been proposed. 

An interesting property of SVM, particularly for the purpose of heterogeneous data integration, 
is that the optimization problem only involves the training patterns U{ through pairwise inner 
products of the form ujuj. Moreover, once the classifier is trained, the computation of h(u) to 
predict the label of a new point u also involves only patterns through inner products of the form 
u T Ui. Hence, rather than computing and storing each individual pattern as a vector u, we just 
need to be able to compute inner products of the form u T u' for any two patterns u and v! in order 
to train an SVM and use it as a prediction engine. This inner product between patterns u and u' 
is a particular case of what is called a kernel and denoted K(u,u') = u T u', to emphasize the fact 
that it can be seen as a function that associate a number to any pair of patterns (u,u), namely 
their inner product. More generally a kernel is a function that computes the inner product between 
two patterns u and u after possibly mapping them to some vector space with inner product by a 
mapping (f), i.e., K(u,u') = <j>(u) \ <j)(u)' . 

Kernels are particularly relevant when the patterns are represented by vectors of large dimen- 
sions, whose inner products can nevertheless be computed efficiently. They are also powerful tools 
to integrate heterogeneous data. Suppose for example that each pattern u can be represented as 
two different vectors and u^ 2 \ This could be the case, for example, if one wanted to represent 
a protein u either by a vector of expression profile iS 1 ' or by a vector of phylogenetic profile . 
Let now K\ and K2 be the two kernels corresponding to inner products for each representation, 
namely, K\ (u,u f ) = u^ T u^ and K2(u,u') = u^ T u^' . If we now want to represent both types of 
features into a single representation, a natural approach would be, e.g., to concatenate both vectors 

and into a single vector, which we denote by © u^ 2 ' (also called the direct sum of 
and u^). In order to use this joint representation in an SVM, we need to be able to compute the 
inner products between direct sums of two patterns to define a joint kernel K joint- Interestingly, 
some simple algebra shows that the resulting inner product is easily expressed as the sum of the 
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inner products of each representation, i.e.: 

if JM! (^') = (« (1) ^ (2) ) T (« (1) 'e« (2) ') 

= Ki(u,u') + K 2 (u,u'). 

Consequently, the painstaking operation of concatenation between two vectors of potentially large 
dimension is advantageously replaced by simply doing the sum between two kernels. More generally, 
if k different representations are given, corresponding to k different kernels, then summing together 
the k kernels results in a joint kernel that integrates all different representations. The sum can also 
be replaced by any convex combination (linear combination with nonnegative weights) in order to 
weight differently the importance of different features |24j . 

2.3 Graph inference as a pattern recognition problem 

Let us now return to the graph reconstruction problem, as presented in Section |2~T1 At first sight, 
this problem is very similar to the general pattern recognition paradigm recalled in Section [221 given 
pairs of vertices with positive and negative labels, infer a function / to predict whether a new pair 
has a positive label (i.e., is connected) or not. An important difference between the two problems, 
however, is that the features available in the graph reconstruction problem describe properties of 
individual vertices v, and not of pairs of vertices (v, v'). Thus, in order to apply pattern recognition 
techniques such as the SVM to solve the graph reconstruction problem, we can follow one of two 
possible avenues: 

1. Reformulate the graph reconstruction problem as a pattern recognition problem where bi- 
nary labels are attached to individual vertices (and not to pairs of vertices). Then pattern 
recognition methods can be used to infer the label of vertices based on their features. 

2. Keep the formulation as the problem of predicting the binary label of a pair of vertices, but 
find a way to represent as vectors (or as a kernel) pairs of vertices, while we initially only have 
features for individual vertices. 

Both directions are possible and have been investigated by different authors, leading to different 
algorithms. In Section 12.41 we present an instantiation of the first idea, which rephrases graph 
reconstruction as a combination of simple pattern recognition problems at the level of individual 
vertices. In Section l2~5l we present several instantiations of the second strategies, which amount to 
defining a kernel for pairs of vertices from a kernel for individual vertices. 

2.4 Graph inference with local models 

In this section we describe an approach that was proposed by [8] for the reconstruction of metabolic 
and PPI networks, and also successfully applied by [27] for regulatory network inference. The basic 
idea is very simple, an can be thought of as a "divide-and-conquer" strategy to infer new edges 
in a graph. Each vertex of the graph is considered in turn as a seed vertex, independently from 
the others, and a "local" pattern recognition problem is solved to discriminate the vertices that are 
connected to this seed vertex against the vertices that are not connected to it. The local model 
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can then be applied to predict new edges between the seed vertex and other vertices. This process 
is then repeated with other vertices as seed to obtain edge prediction throughout the graph. More 
precisely, the "local model" approach can be described as follows: 

1. Take a seed vertex v seec i in V. 

2. For each pair (v see d,v') with label y in the training set, associate the same label y to the 
individual vertex v'. This results in a set of labeled vertices l(v[, t±), . . . , (v' n ^ v d ytn(v seed ))\i 
where n(v see d) is the number of pairs starting with v see d in the training set. We call this set 
a local training set. 

3. Train a pattern recognition algorithm on the local training set designed in step 2. 

4. Predict the label of any vertex v' that has no label, i.e., such that (v see d,v') is not in the 
training set. 

5. If a vertex v' has a positive predicted label, then predict that the pair (v see d, v') has a positive 
label (i.e., is an edge). 

6. Repeat step 1-5 for each vertex v see d in V. 

7. Combine the edges predicted at each iteration together, to obtain the final list of predicted 
edges. 

This process is illustrated in Figure [2j Intuitively, such an approach can work if the features about 
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Figure 2: Illustration of one binary classification problem that is generated from the graph inference 
problem of FigureQ]with the local model approach. Taking the shaded vertex as seed, other vertices 
in the training set are labeled as +1 of —1 depending on whether they are known to be connected 
or to be not connected to the shaded vertex. The goal is then to predict the label of vertices not 
used during training. The process is then repeated by shading each vertex in turn. 

individual vertices provide useful information about whether or not they share a common neighbor. 
For example, the approach was developed by [27] to reconstruct the gene regulatory network, i.e., to 
predict whether a transcription factor v regulates a gene v' , using a compendium of gene expression 
levels across a variety of experimental conditions as features. The paradigm seems particularly 
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relevant in that case. Indeed, if two genes are regulated by the same TF, then they are likely to 
behave similarly in terms of expression level; conversely, if a gene v' is known to be regulated by a 
TF v, and if the expression profile of another gene v" is similar to that of v', then one can predict 
that v" is likely to be regulated by v. The pattern recognition algorithm is precisely the tool that 
automatizes the task of predicting that v" has positive label, given that v' has itself a positive label 
and that v' and v" share similar features. 

We note that this local model approach is particularly relevant for directed graphs, such as gene 
regulatory networks. If our goal is to reconstruct an undirected graph, such the PPI graph, then one 
can follow exactly the same approach, except that (i) each undirected training pair {v, v'} should 
be considered twice in step 2, namely as the directed pair (v, v') for the local model of v and as the 
directed pair (v',v) for the directed model of v', and (ii) in the prediction step for an undirected 
pair {v, v'}, the prediction of the label of the directed pair (v, v') with the local model of v must be 
combined with the prediction of the label of the directed pair (v 1 , v) made by the local model of v' . 
In [8], for example, in the prediction step the score of the directed pair (v, v') is averaged with the 
score of the directed pair (v',v) to obtain a unique score for the undirected pair {v,v'}. 

In terms of computational complexity, it can be very beneficial to split a large pattern recogni- 
tion problem into several smaller problems. Indeed, the time and memory complexities of pattern 
recognition algorithms such as SVM are roughly quadratic or worse in the number of training ex- 
amples. If a training set of N pairs is split into s local training sets of roughly N/s patterns each, 
then the total cost of running s SVM to estimate local models will therefore be of the order of 
s x (N/s) 2 = N 2 /s. Hence if a local model is built for each vertex (s = n), one can expect a speed- 
up of the algorithm of up to a factor of n over an SVM that would work with N pairs as training 
patterns. Moreover, the local problems associated to different seed vertices being independent from 
each others, one can trivially benefit from parallel computing architectures by training the different 
local models on different processors. 

On the other hand, an apparently important drawback of the approach is that the size of each 
local training set can become very small if, for example, a vertex has few or even no known neighbors. 
Inferring accurate predictive models from few training examples is known to be challenging in 
machine learning, and in the extreme case where a vertex has no known neighbor during training, 
then no new edge can ever be predicted. However, the experimental results, reported by [3, [27] and 
in Section [3j show that one can obtain very competitive results with local models in spite of this 
apparent difficulty. 

2.5 Graph inference with global models 

Splitting the training set of labeled pairs to make independent local models, as presented in Section 
12.41 prevents any sharing of information between different local models. Using a slightly different 
inference paradigm, one could argue that if a pair (v, v') is known to be connected, and if both v 
is similar to v" and v' is similar to v'" in terms of features, then the pair (v",v"') is likely to be 
connected as well. Such induction principle is not possible with local models, since the pair (v, v') 
is only considered by the local model of v, while (v",v"') is only considered by the local model of 
v". 

In order to implement this inference paradigm, we need to work directly with pairs of vertices 
as patterns, and in particular to be able to represent any pair (u, v) G V x V by a feature vector 
which we denote ip(u,v). As we originally have only data to characterize each individual protein 
v by a vector (p(v), we therefore need to clarify how to derive a vector for a pair tp(u, v) from the 
vectors <f>(u) and 4>(v) that characterize u and v. This problem is illustrated in Figure [3l 

As suggested in Section l2~2l kernels offer various useful tricks to design features, or equivalently 
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Figure 3: With global models, we want to formulate the problem of edge prediction as a binary 
classification problem over pairs of vertices. A pair can be connected (label +1) or not connected 
(label —1). However the data available are attributes about each individual vertices (central picture). 
Hence we need to define a representation for pairs of vertices, as illustrated on the right-hand picture, 
in order to apply classical pattern recognition methods to discriminate between interacting and non- 
interacting pairs in the graph shown in the left-hand picture. 



kernels, for pairs of vertices starting from features for individual vertices. Let us consider for example 
a simple, although not very useful, trick to design a vector representation for a pair of vertices from 
a vector representation of individual vertices. If each vertex v is characterized by a vector of features 
cf)(v) of dimension p, we can choose to represent a pair of vertices (u, v) by the concatenation of the 
vectors </>(it) and (f>(v) into a single vector x/j<^(u,v) of size 2p. In other words, we could consider 
their direct sum defined as follows: 

^(m)=W^W= ( ) • (2) 

If the dimension p is large, one can avoid the burden of computing and storing large-dimensional 
vectors by using the kernel trick. Indeed, let us denote by Ky the kernel for vertices induced by 
the vector representation (f>, namely, K\/{v,v') = (f>(v) T 4>(v') for any pair of vertices (v,v'), and 
let us assume that Ky(v,v') can be easily computed. Then the following computation, similar to 
shows that the kernel if® between two pairs of vertices (a, b) and (c, d) induced by the vector 
representation ip® is easily computable as well: 

((a, b), (c, d)) = V>©(a, &) T V>©(c, d) 

( 0(a) V ( He) \ 

- v m ) \ m j (3) 

= <P(a) T <P(c) + 4>{b) T m 
= K v (a,c) +K v (b,d) . 

Hence the kernel between pairs is here simply obtained by summing individual kernels, and an 
algorithm like an SVM could be trained on the original training set of labeled pairs, to predict the 
label of new pairs not in the training set. Although attractive at first sight, this formulation has an 
important limitation. Training an SVM (or any linear classifier) means that one estimates a linear 
function in the space of direct sums, i.e., a function for pairs of the form: h(u,v) = w T tp^,(u,v). 
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The vector w (of size 2p) can be decomposed as a concatenation of two parts w\ and W2 of size p, 
i.e., w = wi © w 2 . We can then rewrite the linear function as: 

h(u, v) = (wi © w>2) T {4>( u ) © 4>( v )) = wl<f>{u) + W2~T4>{v) . 

Hence any linear classifier h(u, v) in the space defined by the direct sum representation decomposes 
as a sum of two independent functions: 

h(u, v) = hi(u) + h 2 (y) , 

with hi(v) = wjv for i = 1,2. This is in general an unfortunate property since it implies, for 
example, that whatever the target vertex u, if we sort the candidate vertices v that can interact 
with u according to the classifier (i.e., if we rank v according to the value of h(u,v)), then the 
order will not depend on u. In other words, each vertex v would be associated to a particular score 
h,2(v) that could be thought of as its general propensity to interact, and the prediction of vertices 
connected to a particular vertex u would only depend on the scores of the vertices tested, not on u 
itself. This clearly limits the scope of the classification rules that linear classifiers can produce with 
the direct sum representations, which suggests that this approach should not be used in general. 

A generally better alternative to the direct sum ip^(u,v) is to represent a pair of vertices (u,v) 
by their direct product: 

ip®(u,v)=<i>(u)®<i>(v). (4) 

If 4>{u) and (f)(v) each has a dimension p, then the direct product ij)®{u,v) is by definition a vector 
of dimension p 2 whose entries are all possible products between a feature of 4>(u) and a feature of 
4>(v). An interesting property of the direct product is that it encodes features that are characteristic 
of the pair (u, v), and not merely of u and v taken separately. For example, let us assume that 4>{u) 
and 4>(v) contain binary features that indicate the presence or absence of particular features in u 
and v. Then, because the product of binary features is equivalent to a logical AND, the vector 
tp($(u,v) contains binary features that indicate the joint occurrence of particular pairs of features 
in u and v. As a result, contrary to the direct sum representation ip®(u,v), linear classifiers in the 
space defined by ipg)(u,v) could predict that a is more likely to interact with u than b, while b is 
more likely to interact with v than a, for two different target vertices u and v. 

The price to pay in order to obtain this large flexibility is that the dimension of the repre- 
sentation, namely p 2 , can easily get very large. Typically, if an individual gene is characterized 
by a vector of dimension 1,000 to encode expression data, phylogenetic profiles and/or subcellular 
localization information, then the direct product representation has one million dimensions. Such 
large dimensions may cause serious problems in terms of computation time and memory storage 
for practical applications. Fortunately, if one works with kernel methods like SVM, a classical trick 
allows to compute efficiently the inner product between two tensor product vectors from the inner 
products between individual vectors: 

K® ((a, b), (c, d)) = i/>®(a, 6) T ^(c, d) 

= (0(a) ^)) T ® m) (5) 

= 0(a) T 0(c) x 0(6) T 0(d) 
= K v (a,c) x K v (b, d) , 

where the third line is a classical result easily demonstrated by expanding the inner product be- 
tween tensor product vectors. Hence one obtains the kernel between two pairs of vertices by just 
multiplying together the kernel values involving each vertex of the first pair and the corresponding 
vertex of the second pair. 
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The direct sum ([2]) and product representations correspond to representations of ordered 
paired, which usually map a pair (u, v) and its reverse (y, u) to different vectors. For example, the 
concatenation of two vectors (j){u) and cj>(v) is generally different from the concatenation of <f>{v) 
and 4>{u), i.e., ip®(u,v) ^ if)^(v,u), except when 4>(u) = 4>{v). Hence these representations are well 
adapted to the prediction of edges in directed graphs, where an ordered pair (u, v) can represent 
an edge form u to v and the pair (v, u) then represents the different edge from v to u. When the 
graph of interest is not directed, then it can be advantageous to also represent an undirected pair 
{u, v}. An extension of the tensor product representation was for example proposed by [3] with the 
following tensor product pairwise kernel (TPPK) representation for undirected pairs: 

IpTPPK ({u, V}) = 1p®(u, v) + 1p®(v, u) . (6) 

This representation is the symetrized version of the direct product representation, which makes it 
invariant to a permutation in the order of the two vertices in a pair. The corresponding kernel is 
easily derived as follows: 

Ktppk ({a, b} , {c, d}) = TpTPPK({a, b}) 1 tPtppk({c, d}) 

= + i>®(b,a)) T (4>®{c,d) +ijj®(d,c)) 

= ip®(a,b) T ip®(c,d) +Tp®(a,b) T ip®(d,c) (7) 

+ i>®(b, a) T ijj®(c, d) + ip®(b, a) T ip®(d, c) 
= 2{K v (a,c)K v (b,d) + K v (a,d)K v {b,c)} . 

Once again we see that the inner product in the space of the TPPK representation is easily computed 
from the values of kernels between individual vertices, without the need to compute explicitly the 
p 2 -dimension TPPK vector. This approach is therefore, again, particularly well suited to be used 
in combination with an SVM or any other kernel method. 

An alternative and perhaps more intuitive justification for the TPPK kernel j7]) is in terms of 
similarity or distance between pairs induced by this formulation. Indeed, when a kernel Ky is such 
that Kv(v,v) = 1 for all v, which equivalently means that all vectors (j)(v) are normalized to unit 
norm, then the value of the kernel Ky(u,v) is a good indicator of the "similarity" between u and v. 
In particular we easily show in that case that: 

V I \ At \Trk( \ 1 ll^(^) ~^)l| 2 

K v (u, v) = <p{u) (p(v) = 1 , 

which shows that Ky(u,v) is "large" when <p(u) and 4>(v) are close to each other, i.e., when u 
and v are considered "similar". An interesting point of view to define a kernel over pairs in this 
context is then to express it in terms of similarity: when do we want to say that an unordered pair 
{a, b} is similar to a pair {c, d}, given the similarities between individual vertices? One attractive 
formulation is to consider them similar if either (i) a is similar to c and b is similar to d, or (ii) 
a is similar to d and b is similar to c. Translating these notions into equation, the TPPK kernel 
formulation ([7]) can be thought of as an implementation of this principle [3]. 

At this point, it is worth mentioning that although the tensor product (]4| for directed pairs, and 
its extension j6]) for undirected pairs, can be considered as "natural" default choices to represent 
pairs of vertices as vectors from representations of individual vertices, they are by no means the 
only possible choices. As an example, let us briefly mention the construction of [42] who propose to 
represent an undirected pair as follows: 

^MLPK («, v) = (<Xu) - (P(v)f 2 = (0(n) - (f>(v)) ® (<Xu) - <f>{v)) . (8) 
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The name MLPK stands for metric learning pairwise kernel. Indeed, [42] shows that training a linear 
classifier in the representation defined by the MLPK vector ([8]) is equivalent, in some situations, to 
estimating a new metric in the space of individual vertices (f>(v), and classifying a pair as positive 
or negative depending on whether or not the distance between (p(u) and <j){v) (with respect to 
the new metric) is below a threshold or not. Hence this formulation can be particularly relevant 
in cases where connected vertices seem to be "similar", in which case a linear classifier coupled 
with the MLPK representation can learn by itself the optimal notion of "similarity" that should 
be used in a supervised framework. For example, if a series of expression values for genes across 
a range of experiments is available, one could argue that proteins coded by genes with "similar" 
expression profiles are more likely to interact than others, and therefore that a natural way to 
predict interaction would be to measure a "distance" between all pairs of expression profiles and 
threshold it above some value to predict interactions. The question of how to chose a "distance" 
between expression profiles is then central, and instead of choosing a priori a distance such as the 
Euclidean norm, one could typically let an SVM train a classifier with the MLPK representation to 
mimic the process of choosing an optimal way to measure distances in order to predict interactions. 

An interesting property of the MLPK representation J8j) is that, as for the tensor product and 
TPPK representation, it leads to an inner product that can easily be computed without explicitly 
computing the p 2 -dimensional vector (j)MLPK{.a-,b): 



Kmlpk ({a, b} , {c, d}) = iPmlpk (a, b) T iPmlpk (c, d) 

T r 



&2 



We) -my 



(a) - cf>(b)) T (<f>(c) - cf>(d)) 



(a) T 4>(c) - 4>(ay4>(d) - <P(b) T <P(c) + <P(b) T 4>(d) 
[K v (a,c) - K v (a,d) - K v (b,c) + K v (b,d)} 2 . 



(9) 



2.6 Remarks 

We have shown how the general problem of graph reconstruction can be formulated as a pattern 
recognition problem (Sections l2,HI2.3p . and described several instances of this idea: either by training 
a multitude of local models to learn the local structure of the graph around each node (Section 12.41) . 
which boils down to a series of pattern recognition problems over vertices, or by training a single 
global model to predict whether any given pair of vertices interacts or not, which requires the 
definition of a vector representation (or equivalently of a kernel) for pairs of vertices (Section 12.51) . 
Our presentation has been fairly general, in order to highlight the general ideas behind the approach 
and the main choices one has to make in order to implement it. Now, we discuss several important 
questions that one must also address to implement the idea on any particular problem. 

• Directed or undirected graph. As pointed out in the introduction, some biological networks are 
better represented by undirected graphs (e.g., the PPI network) while other are more naturally 
viewed as directed graphs (e.g., a gene regulatory network). In the course of our presentation 
we have shown that some methods are specifically adapted to one case or the other. For 
example, the MLPK and TPPK kernel formulations to learn global models (equations [7] and [9]) 
are specifically tailored to solve problems over undirected pairs, i.e., to reconstruct undirected 
graphs. On the other hand, the local models (Section I2.4p or the global models with the 
direct product kernel ([5]) are naturally suited to infer interactions between directed pairs, 
i.e., to reconstruct directed graphs. However, one can also use them to reconstruct undirected 
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graph by simply counting each undirected pair {u, v} as two directed pairs (u, v) and (v, u). In 
the training step, this means that we can replace each labeled undirected pair (i.e., undirected 
edge known to be present or absent) by two directed pairs labeled by the same label. In the 
prediction step, this means that one would get a prediction for the pair (u, v) and another 
prediction for the pair (v,u), that have no reason to be consistent between each other to 
predict whether the undirected pair {u,v} is connected or not. In order to reconcile both 
predictions, one typically can take the average of the prediction scores of the classifiers for 
both directed pairs in order to make a unique prediction score for the undirected pair. 

• Different types of edges. Some biological networks are better represented by graphs with edges 
having additional attributes, such as a label among a finite set of possible labels. For example, 
to describe a gene regulatory network it is common to consider two types or regulations (edges), 
namely activation or inhibition. In terms of prediction, this means that we not only need to 
predict whether two vertices are connected or not, but also by what type of edges they are 
connected. A simple strategy to extend the pattern recognition paradigm to this context is 
to see the problem not as a binary classification problem, but more generally as a multi-class 
classification problem. In the previous example, one should for example assign each pair 
(u, v) to one of the three classes (no regulation, activation, inhibition). Luckily the extension 
of pattern recognition algorithms to the multi-class setting is a well-studied field in machine 
learning for which many solutions exist (TTJ E]- For example, a popular approach to solve 
a classification problem with k classes is to replace it by k binary classification problems, 
where each binary problem discriminates versus data in one of the k classes and the rest of 
the data. Once the k classifiers are trained, they can be applied to score each new candidate 
point, and the class corresponding to the classifier that outputs the largest score is predicted. 
Other approaches also exist besides this scheme, known as the one-versus-all strategy. Overall 
they show that the pattern recognition formulation can easily accommodate the prediction of 
different edge types just by using a multi-class classification algorithm. 

• Negative training pairs. While most databases store information about the presence of edges 
and can be used to generate positive training examples, few if any negative interactions are 
usually reported. This is an important problem since, as we formulated it in Section 12.21 the 
typical pattern recognition formalism requires positive as well as negative training examples. 
In order to overcome this obstacle several strategies can be pursued. A first idea would be to 
refrain from focusing exclusively on pattern recognition algorithms which are not adapted to 
the lack of negative examples, and use instead algorithms specifically designed to handle only 
positive examples. For example, many methods in statistics for density estimation or outlier 
detection are designed to estimate a small region that contains all or most of the positive 
training points. If such a region of "positive examples" is found around pairs known to be 
connected, then a new pair of vertices can be predicted to be connected if it also lies in the 
region. An algorithm like the one-class SVM [32| is typically adapted to this setting, and 
can accommodate all the kernel formulations we presented so far. A second idea would be to 
keep using algorithms for binary classification, and generate negative examples. Perhaps the 
simplest way to do this is to randomly sample pairs of vertices, among the ones not known to 
be connected, and declare that they are negative examples. As the graph is usually supposed 
to be sparse, most pairs of vertices randomly picked by this process indeed do not interact, 
and are correctly labeled as negative. On the other hand, the few pairs that would be wrongly 
labeled as negative with this procedure, namely the pairs that interact although we do not 
know it yet, are precisely the one we are interested to find. There may then be a danger that 
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by labeling them as negative and training a classifier based on this label, we could have more 
difficulties to find them. To overcome this particular issue of generating false negative examples 
in the training set, one may again consider two ideas. First, try to reduce the quantity of 
wrongly labeled negative training pairs by, e.g., using additional sources of informations to 
increase the likelihood that they to not interact. For example, if one wants to choose pairs 
of proteins that are very unlikely to interact, he may restrict himself to proteins known to be 
located in different subcellular localization, which in theory prevent any possibility of physical 
interaction. While this may increase the size of the training set, there is also a danger to 
bias the training set towards "easy" negative examples [4]. The second idea is to accept the 
risk of generating false negative training examples, but then to be careful at least that the 
predictive models never predict the label of a pair that was used during its training. This 
can be achieved, for example, by splitting the set of candidate negative pairs (i.e., those not 
known to interact) into k disjunct subsets, train a classifier using k — 1 of these subsets as 
negative training examples and using the resulting classifier to predict the labels of pairs in 
the subset that was left apart. Repeating this procedure k times leads to the possibility of 
predicting the labels for the k subsets, without ever predicting the label of a negative example 
that was used during training. This strategy was for example used in [27J. 

• Presence or absence of errors in the training data. Besides the lack of known negative ex- 
amples, one may also be confronted with possible errors in the positive training examples, 
i.e., false positives in the training set. Indeed, many databases of biological networks con- 
tain both certain interactions, and interactions believed to be true based on various empirical 
evidences but that could be wrong. This is particularly true, for example, for PPI networks 
when physical interactions have been observed with high-throughput technologies such as the 
yeast two-hybrid system, which is known to be prone to many false positive detections. In 
that case, we should not only be careful when using the data as positive training examples, 
but we may even consider the possibility of using the predictive algorithms to remove wrong 
positive annotations from the training set. Regarding the problem of training models with 
false positive training examples, this may not be a major obstacle since one of the strengths 
of statistical pattern recognition methods is precisely to accept "noise" or errors in the data. 
On the other hand, if one wants to further use the models to correct the training data, then 
a specific procedure could be imagined, for example similar to the procedure described in the 
previous paragraph to predict the label of false negative examples. 

3 Examples 

Recently, the different approaches, surveyed in Section[2l have been extensively tested and compared 
to other approaches in several publications. In this section, we review the main findings of these 
publications, focusing on our three running examples of biological networks. 

3.1 Reconstruction of a metabolic network 

The reconstruction of metabolic networks has been among the first applications that motivated the 
line of research surveyed in this chapter [45J |43j [46J [8] . We consider here the problem of inferring 
the metabolic gene network of the yeast S. cerevisiae with the enzymes represented as vertices, and 
an edge between two enzymes when the two enzymes catalyse successive reactions. The dataset, 
proposed by [46], consists of 668 vertices (enzymes) and 2782 edges between them which were 
extracted from the KEGG database of metabolic pathways [22]. In order to predict edges in these 
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networks, [8] used various genomic datasets and compared different inference methods. Following 
[46], the data used to characterize enzymes comprise 157 expression data measured under different 
experimental conditions [HIES], a vector of 23 bits representing the localization of the enzymes 
(found or not found) in 23 locations in the cell determined experimentally p2], and the phylogenetic 
profiles of the enzymes as vectors of 145 bits denoting the presence or absence of the enzyme in 145 
fully sequenced genomes [22j. Each type of data was processed and transformed into a kernel as 
described in j36j[23], and all matrices were summed together to produce a single kernel integrating 
heterogeneous data. 

On a common 5-fold cross-validation setting, [8] compared different methods including local 
models (Section l2.4p . the TPPK and MLPK kernels (Section I2.5P as well as several other methods: 
a direct de novo approach which only infers edges between similar vertices, an approach based 
on kernel canonical correlation analysis (KCCA) [45] , and a matrix completion algorithm based on 
an em procedure [HH [23]. On each fold of the cross-validation procedure, each method uses the 
training set to learn a model and makes predictions on pairs in the test set. All methods associate 
a score to all pairs in the test set, hence by thresholding this score at different levels they can 
predict more or less edges. Results were assessed in terms of average ROC curve (which plots the 
percentage of true positives as a function of the percentage of false positives, when the threshold 
level is varied) and average precision/recall curve (which plots the percentage of true positives 
among positive predictions, as a function of the percentage of true positives among all positives). In 
practical applications, the later criteria is a better indicator of the relevance of a method than the 
former one. Indeed, as biological networks are usually sparse, the number of negatives far exceeds 
the number of positives, and only large precision (over a recall as large as possible) can be tolerated 
if further experimental validations are expected. 

Figure [4] shows the performance of the different methods on this benchmark. A very clear 
advantage for the local model can be seen. In particular it is the only method tested that can produce 
predictions at more than 80% precision. There is no clear winner among the other supervised 
methods, while the direct approach which is the only de novo method in this comparison, is clearly 
below the supervised methods. 




Figure 4: Performance of different methods for the reconstruction of metabolic networks (from |8j): 
ROC (left) and precision/recall (right) curves. 
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3.2 Reconstruction of a PPI network 



As a second application, we consider the problem of inferring missing edges in the PPI network of 
the yeast S. cerevisiae. The gold standard PPI graph used to perform a cross-validation experiment 
is a set of high- confidence interactions supported by several experiments provided by [44] and also 
used in [23] . After removal of proteins without interactions we end up with a graph involving 
2438 interactions (edges) among 984 proteins (vertices). In order to reconstruct missing edges the 
genomic data used are the same as those used for the reconstruction of the metabolic network in 
Section I3~T1 namely gene expression, protein localization and phylogenetic profiles, together with a 
set of yeast two-hybrid data obtained from [19] and |39| . The later was converted into a positive 
definite kernel using a diffusion kernel, as explained in [23]. Again, all datasets were combined into 
a unique kernel by adding together the four individual kernels. 

Figure [5] shows the performances of the different methods, using the same experimental protocol 
as the one used for the experiment with metabolic network reconstruction in Section [3~H Again the 
best method is the local model, although it outperforms the other methods with a smaller margin 
than for the reconstruction of the metabolic network (Figured]). Again the ROC curve of the de 
novo direct method is clearly below the curves of the supervised methods, although this time it 
leads to large precision at low recall. This means that a few interacting pairs can very easily be 
detected because they have very similar genomic data. 




Figure 5: Performance of different methods for the reconstruction of the PPI network (from |8j): 
ROC (left) and precision/recall (right) curves. 



3.3 Reconstruction of gene regulatory networks 

Finally, we report the results of an experiment conducted for the inference of a gene regulatory 
network by [27]. In that case the edges between transcription factors and the genes they regulate 
are directed, therefore only the local model of Section 12.41 is tested. It is compared to a panel 
of other state-of-the-art methods dedicated to the inference of gene regulatory networks from a 
compendium of gene expression data, using a benchmark proposed by [14]. More precisely, the goal 
of this experiment is to predict the regulatory network of the bacteria E. coli from a compendium 
of 445 microarray expression profiles for 4345 genes. The microarray were collected under different 
experimental conditions such as PH changes, growth phases, antibiotics, heat shock, different media, 
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varying oxygen concentrations and numerous genetic perturbations. The goal standard graph used 
to assess the performance of different methods by cross-validation consists of 3293 experimentally 
confirmed regulations between 154 TF and 1211 genes, extracted from the RegulonDB database 

In |14| this benchmark was used to compare different algorithms, including Bayesian networks 
|15j . ARACNe [26], and the context likelihood of relatedness (CLR) algorithm [14], a new method 
that extends the relevance networks class of algorithms [9]. They observed that CLR outperformed 
all other methods in prediction accuracy, and experimentally validated some predictions. CLR can 
therefore be considered as state-of-the-art among methods that use compendia of gene expression 
data for large-scale inference of regulatory networks. However, all the methods compared in [14] 
are de novo, and the goal of [27] was to compare the supervised local approach to the best de novo 
method on this benchmark, namely the CLR algorithm. Using a 3-fold cross-validation procedure 
(see details in [27]), they obtained the curves in Figure El We can observe that the local supervised 




Figure 6: Comparison of the CLR method and the local pattern recognition approach (called 
SIRENE) on the reconstruction of a regulatory network: ROC (left) and precision/recall (right) 
curves. The curve SIRENE-Bias corresponds to the performance of SIRENE with a cross-validation 
procedure which does not take into account the organization of genes in operons, thus introducing 
an artificial positive bias in the result. 



approach (called SIRENE for Supervised Inference of REgulatory NEtwork) strongly outperforms 
the CLR method on this benchmark. The recall obtained by SIRENE, i.e., the proportion of known 
regulations that are correctly predicted, is several times larger than the recall of CLR at all levels of 
precision. More precisely, Table Q] compares the recalls of SIRENE, CLR and several other methods 
at 80% and 60% precision. The other methods reported are relevance network [9], ARACNe |26J, 
and a Bayesian network [T5] implemented by [14J. 

This experiment also highlights the special care that must be taken when performing a cross- 
validation procedure, in particular to make sure that no artificial bias is introduced. The curve called 
SIRENE-bias in Figure [6] corresponds to a normal fc-fold cross-validation procedure, where the set of 
genes is randomly split into k folds and each fold is used in turn as test set. In the case of regulation 
in bacteria like E. coli, however, it is known that TFs can regulate groups of genes clustered together 
on the genome, called operons. Genes in the same operons are transcribed in the same messenger 
RNA, and have therefore very similar expression values across different experiments. If two genes 
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Table 1 : Recall of different gene regulation prediction algorithm at different levels of precision (60% 
and 80% (from [27]). 



Method 


Recall at 60% 


Recall at 80% 


SIRENE 


44.5% 


17.6% 


CLR 


7.5% 


5.5% 


Relevance networks 


4.7% 


3.3% 


ARACNe 


1% 


0% 


Bayesian network 


1% 


0% 



within the same operon are split in a training and test set during cross-validation, then it will be 
very easy to recognize that the one in the test set has the same label as the one in the training set, 
which will artificially increase the accuracy of the method. Hence in this case it is important to 
make sure that, during the random split into k subsets, all genes within an operon belong to the 
same fold. The curve names SIRENE in Figure [6] has been obtained with this unbiased procedure. 
The important difference between both curves highlights the importance of the bias induced by 
splitting operons in the cross-validation procedure. 

4 Discussion 

We reviewed several strategies to cast the problem of graph inference as a classical supervised 
classification problem, which can be solved by virtually any pattern recognition algorithm. Contrary 
to de novo approaches, these strategies assume that a set of edges is already known and use the 
data available about vertices and known edges to infer missing edges. On several experiments 
involving the inference of metabolic, PPI and regulatory networks from a variety of genomic data, 
these methods were shown to give good results compared to state-of-the-art de novo methods, and 
a particular implementation of this strategy (the local model) consistently gave very good results 
on all datasets. 

In a sense the superiority of supervised methods over de novo methods observed in the experi- 
ments is not surprising, because supervised methods use more informations. As in many real-world 
applications this additional information is available, it suggests that supervised methods may be a 
better choice than de novo ones in many cases. It should be pointed out, though, that some of the 
methods we classified as de novo, like for example Bayesian networks, could easily be adapted to the 
supervised inference scenario by putting constraints or prior distribution on the graph to be inferred. 
On the other hand, the strength of supervised methods depends critically on the availability of a 
good training set, which may not be available in some situations, such as inferring the structure of 
smaller graphs. 

We observe that there is not a single way to cast the problem as a binary classification problem, 
which suggests that further research is needed to design optimally adapted methods. In particular, 
the local method, which performs best in the 3 benchmark experiments, has obvious limitations, 
such as its inability to infer new edges for vertices with no edge already known. The development 
of new strategies that keep the performance of the local methods for vertices with enough known 
edges, but borrow some ideas from, e.g., the global models of Section [231 to be able to infer edges 
for vertices with few or no known edge, is thus a promising research direction. 
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